Add AnyFlow Any-Step Video Diffusion Pipelines (Bidirectional + FAR Causal)#13745
Conversation
…vel imports This is the lazy-loader scaffolding only. Body files (pipeline_anyflow.py, pipeline_anyflow_causal.py, transformer_anyflow.py, scheduling_flow_map_euler_discrete.py) come in subsequent commits.
The flow-map scheduler advances samples from timestep t to caller-provided target r in a single Euler step, supporting any-step sampling on flow-map- distilled checkpoints. It is a general-purpose scheduler — not specific to the AnyFlow checkpoints. Tests: 12 standalone tests covering instantiation, set_timesteps endpoints, shift identity/monotonicity, step shape preservation, zero-interval identity, one-shot sampling, train weight schemes, scale_noise endpoints. Docs: api/schedulers/flow_map_euler_discrete.md
A 3D DiT extending the v0.35.1 Wan2.1 backbone with two config-toggled modules: * FAR causal blocks (init_far_model=True): block-sparse causal attention via flex_attention + compressed-frame patch embedding for frame-level autoregressive generation (Gu et al., 2025, arXiv:2503.19325). * Dual-timestep flow-map embedding (init_flowmap_model=True): adds a delta timestep embedder enabling flow-map sampling z_t -> z_r over arbitrary intervals (AnyFlow). With both flags off, the model reduces to stock Wan2.1. The class is intentionally self-contained rather than annotated with '# Copied from diffusers.models.transformers.transformer_wan' because upstream Wan has been refactored extensively since v0.35.1 (new WanAttention class, different processor architecture). Tests: 9 unit tests covering construction in 3 modes, bidi forward shape and determinism, return_dict variants, save/load round-trip with and without init_far_model, gradient checkpointing toggle. Docs: api/models/anyflow_transformer3d.md
* AnyFlowPipeline (pipeline_anyflow.py, ~590 LOC): bidirectional T2V using
flow-map sampling. Loads checkpoints from nvidia/AnyFlow-Wan2.1-T2V-{1.3B,14B}.
* AnyFlowCausalPipeline (pipeline_anyflow_causal.py, ~700 LOC): FAR-based
causal pipeline supporting T2V/I2V/TV2V via task_type kwarg. Loads checkpoints
from nvidia/AnyFlow-FAR-Wan2.1-{1.3B,14B}-Diffusers.
Both pipelines reuse stock WanLoraLoaderMixin, AutoencoderKLWan, UMT5EncoderModel,
and AutoTokenizer from upstream. The transformer is the AnyFlowTransformer3DModel
introduced in the previous commit. The scheduler is FlowMapEulerDiscreteScheduler.
Tests:
* tests/pipelines/anyflow/test_anyflow.py: PipelineTesterMixin fast tests +
slow integration test against nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers.
* tests/pipelines/anyflow/test_anyflow_causal.py: same structure for FAR variant.
Reference slices for slow integration tests are deferred to Phase 7
(Final quality pass) where the user runs them on a real GPU.
Modeled on the Helios pipeline doc (PR huggingface#13208). Sections: paper link + abstract, supported checkpoints table, memory/speed optimization tabs, T2V/I2V/TV2V examples for both bidirectional and causal variants, autodoc trailers.
…ersion script * Register AnyFlowPipeline in AUTO_TEXT2VIDEO_PIPELINES_MAPPING. * AnyFlowCausalPipeline is intentionally NOT registered for AutoPipeline because its task switch (t2v / i2v / tv2v) is too rich for a single auto-resolve key. * scripts/convert_anyflow_to_diffusers.py: convert .pt training checkpoints (with 'ema' state dict) into a diffusers save_pretrained layout. Supports all 4 released NVIDIA AnyFlow variants. Replaces the omegaconf-based config in the upstream repo with argparse to match other diffusers conversion scripts.
* ruff format pass on all 5 source files (long lines + trailing comma fixes) * check_dummies.py --fix_and_overwrite regenerated: - dummy_pt_objects.py: AnyFlowTransformer3DModel + FlowMapEulerDiscreteScheduler - dummy_torch_and_transformers_objects.py: AnyFlowPipeline + AnyFlowCausalPipeline Local fast tests: 21/21 passed - 12 scheduler tests (FlowMapEulerDiscreteScheduler) - 9 transformer tests (AnyFlowTransformer3DModel construction + bidi forward + save/load) The pipeline fast tests in tests/pipelines/anyflow/ require a local dev install that matches the diffusers main branch's transformers >= compatibility floor. The reference slices for slow integration tests (real GPU + 1.3B/14B checkpoints) are intentionally left as TODO stubs to be captured by the user on a real GPU machine before opening the PR.
…torials
Critical bug fixes (verified against precision-validation review):
* pipeline_anyflow.py / pipeline_anyflow_causal.py: replace hardcoded
transformer_dtype = torch.bfloat16 with self.transformer.dtype, so
pipe.to("cpu") and PipelineTesterMixin save/load tests do not crash on a
dtype mismatch in the patch_embedding conv3d.
* transformer_anyflow.py: drop the duplicate `base = base = ...` assignment in
_build_causal_mask (was a copy-paste typo carried over from FAR-Dev).
* transformer_anyflow.py: drop unused `q_is_context` / `k_is_context` locals
and the `# noqa: F841` markers that were silencing the dead-store warning.
* transformer_anyflow.py: remove `CacheMixin` from the inheritance list — the
pipeline manages KV cache directly, the mixin's interface is unused.
* transformer_anyflow.py: guard the module-level `torch.compile(flex_attention)`
with try/except so the file imports cleanly on CPU CI / no-Triton machines.
* convert_anyflow_to_diffusers.py: replace ad-hoc print warnings with the
stdlib logger (warning_once-style) and a module-level basicConfig.
Documentation accuracy:
* AnyFlowCausalPipeline class docstring + main pipeline doc + EN/ZH tutorial:
drop the fictitious `task_type` / `image` / `video` arguments and document
the real API: pass `context_sequence={"raw": tensor}` (or `{"latent": ...}`)
to switch between T2V (None) / I2V (1-frame) / TV2V (4n+1-frame) modes.
* Pipeline class docstrings + main doc: explicitly describe AnyFlow's
two-stage LoRA distillation including DMD reverse-divergence supervision
with Flow-Map backward simulation in stage 2 (was previously implicit).
* training_rollout: add detailed docstring explaining its role as the
3-segment Flow-Map backward simulation entry point used during DMD training.
* Long-form tutorial doc `using-diffusers/anyflow.md` (EN, 239 LOC) and
Chinese mirror `docs/source/zh/using-diffusers/anyflow.md` (224 LOC) added
and registered in both `_toctree.yml` files.
Tests:
* Skip `test_attention_slicing_forward_pass` in both pipeline test classes
with a clear rationale (custom attention processor does not support slicing).
* All 21 standalone tests still pass (12 scheduler + 9 transformer).
Quality gates:
* `ruff check` clean across all AnyFlow files.
* `ruff format --check` reports 6 files already formatted.
* `python utils/check_copies.py` reports no diff.
Out of scope for this commit (deferred until reviewer feedback):
* Splitting AnyFlowTransformer3DModel into bidi + causal subclasses
* Unifying _forward_inference / _forward_cache return types
* Migrating model tests from plain unittest to BaseModelTesterConfig + mixins
* HF model card / config.json metadata updates on the nvidia/* repos
(push to Hub manually before opening the PR)
… output
Round 2 of review feedback. Three groups of changes; transformer state-dict
keys, module hierarchy, and tensor flow are unchanged so the H200 bit-exact
validation remains valid.
A. Pipeline rename (mechanical, no behavior change):
* Class: AnyFlowCausalPipeline -> AnyFlowFARPipeline (Causal in diffusers
usually means an attention mask; AnyFlow's variant is FAR autoregressive,
so the FAR name is more specific and matches the paper).
* File: pipeline_anyflow_causal.py -> pipeline_anyflow_far.py (git mv).
* Test file: test_anyflow_causal.py -> test_anyflow_far.py (git mv).
* All references updated in src/, tests/, docs/, scripts/, plus stale
anyflowcausalpipeline anchor links in tutorial markdown.
B. Pipeline test bug fixes (closes 19 fast-test failures reported by
precision-validation reviewer):
* pipeline_anyflow.py / pipeline_anyflow_far.py: __call__ now sets
self._num_timesteps = num_inference_steps before the rollout, so the
PipelineTesterMixin callback tests can read pipe.num_timesteps.
* tests/pipelines/anyflow/test_anyflow_far.py: drop the fictitious
task_type="t2v" kwarg that crashed every causal fast test (the FAR
pipeline selects mode via context_sequence, not a task_type arg).
C. Transformer architecture cleanups (review-driven, no tensor changes):
* Replace forward(*args, **kwargs) dispatcher with an explicit signature
listing every supported kwarg (hidden_states, timestep, r_timestep,
encoder_hidden_states, encoder_hidden_states_image, chunk_partition,
clean_hidden_states, clean_timestep, kv_cache, kv_cache_flag, is_causal,
attention_kwargs, return_dict). Helps IDE / type-checker / torch.compile
tracing.
* Drop SimpleNamespace returns. Add AnyFlowFARTransformerOutput
(BaseOutput dataclass with sample + kv_cache fields) for the two causal
paths that need to also propagate kv_cache (_forward_inference and the
newly return_dict-aware _forward_cache). _forward_train and
_forward_bidirection now consistently return Transformer2DModelOutput.
Pipeline call sites already use return_dict=False with positional
unpacking, so the fix is transparent there.
Out of scope (deferred until canonical-org HF metadata sync):
* Splitting AnyFlowTransformer3DModel into a bidi class plus an
AnyFlowFARTransformer3DModel subclass — touches register_to_config keys
and would require updating model_index.json on every released checkpoint.
* Promoting chunk_partition from register_to_config to a forward-time
argument (same reason).
* Renaming training_rollout to _denoise — would break callers in the
FAR-Dev on-policy trainer that produced the released checkpoints.
Local fast tests: 21/21 still pass (12 scheduler + 9 transformer).
ruff check, ruff format, and check_copies.py are all clean.
…nk_partition to FAR fast-test fixture
Two root causes for the 19 remaining PipelineTesterMixin failures, identified
by the H200 reviewer:
1. callback_on_step_end was accepted by __call__ but never invoked. Both
pipelines pass it through to training_rollout (and FAR additionally through
inference()), and inference_range now fires it after scheduler.step in
the standard inference branch:
if callback_on_step_end is not None:
callback_kwargs = {k: locals()[k] for k in callback_on_step_end_tensor_inputs}
callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
latents = callback_outputs.pop("latents", latents)
prompt_embeds = ...
negative_prompt_embeds = ...
`nonlocal prompt_embeds, negative_prompt_embeds` lets the callback rewrite
the closure-captured embeddings, matching upstream WanPipeline semantics.
The 3-segment grad_timestep training rollout does not invoke the callback;
it is intentionally training-only.
2. tests/pipelines/anyflow/test_anyflow_far.py::get_dummy_components built
the dummy transformer without a `chunk_partition`, leaving it None on the
model config and crashing the pipeline at `sum(self.transformer.config.chunk_partition)`.
Set `chunk_partition=[1, 1, 1]` in the fixture (3 chunks of 1 latent frame
each, matching the test's num_frames=9 -> 3 latent frames).
Local fast tests: 21/21 still pass.
ruff check, ruff format, and check_copies.py are all clean.
…ig + rename helpers
Major architectural refactor that aligns the integration with diffusers conventions
ahead of the canonical-org Hub upload. State-dict keys, module hierarchy, and
tensor flow are unchanged so the H200 bit-exact validation remains valid; only
the on-disk transformer/config.json fields move.
Changes:
1. **Sibling transformer classes** replace the flag-driven single class:
* AnyFlowTransformer3DModel — bidirectional only. Drops compressed_patch_size /
full_chunk_limit / init_far_model / init_flowmap_model / chunk_partition
kwargs (always-on for AnyFlow distilled checkpoints).
* AnyFlowFARTransformer3DModel — adds far_patch_embedding + the 3 FAR forward
paths (train / cache-prefill / autoregressive inference).
* AnyFlowTimeTextImageEmbedding (the legacy single-time embedder used only by
the old setup_flowmap_model bootstrap) is removed; both classes now build
AnyFlowDualTimestepTextImageEmbedding directly in __init__.
* setup_flowmap_model / setup_far_model methods are removed; weight warm-start
for far_patch_embedding (trilinear interpolation from patch_embedding) moves
into AnyFlowFARTransformer3DModel.__init__.
2. **chunk_partition** is no longer a model config field. The FAR pipeline owns
the schedule:
* AnyFlowFARPipeline.default_chunk_partition = [1, 3, 3, 3, 3, 3, 3, 2]
matches the released 81-frame NVIDIA checkpoints.
* AnyFlowFARPipeline.__call__ / _denoise_rollout accept a chunk_partition
argument that overrides the default for non-default num_frames.
3. **training_rollout -> _denoise_rollout** rename across both pipelines and all
English / Chinese docs that referenced it. Signals the method is internal to
the pipeline driver, not a public training API.
4. **Conversion script + tests + docs + registries**:
* scripts/convert_anyflow_to_diffusers.py: VARIANTS dict picks the right
transformer class per variant; init_far_model / init_flowmap_model /
chunk_partition kwargs are removed from the from_pretrained call.
* Transformer test file split into AnyFlowTransformer3DModelTest and
AnyFlowFARTransformer3DModelTest classes.
* Pipeline test fixtures use the right class and pass chunk_partition via
get_dummy_inputs (3-frame schedule [1, 1, 1] for the 9-frame test).
* New docs page docs/source/en/api/models/anyflow_far_transformer3d.md;
anyflow_transformer3d.md rewritten for the bidi-only class.
* AnyFlowFARTransformer3DModel registered in src/diffusers/__init__.py,
src/diffusers/models/__init__.py, models/transformers/__init__.py and the
dummy_pt_objects.py stubs.
* docs/source/en/_toctree.yml: new entry for the FAR transformer page.
5. **Cleanups**:
* Pipeline __call__ no longer passes is_causal=False to the bidi forward (the
bidi class doesn't accept it).
* Pipeline class docstrings drop stale references to init_*_model flags.
Local tests: 22/22 pass (12 scheduler + 10 transformer covering both classes).
ruff check / format / check_copies clean.
Hub artifacts (model_index.json, transformer/config.json, scheduler config) need
to be regenerated for the released checkpoints; the HF update guide will be
delivered separately.
…models.md Hard violations (per official diffusers guidelines): * drop einops dependency — replace 25+ rearrange() calls with native permute/reshape/unflatten in transformer + both pipelines * device-gate torch.float64 — apply_rotary_emb and AnyFlowRotaryPosEmbed now fall back to float32 / complex64 on MPS / NPU; freqs are lazily rebuilt per-device via _build_freqs (matches transformer_wan / transformer_flux pattern) * migrate attention to dispatch_attention_fn — replace direct F.scaled_dot_product_attention calls with dispatch_attention_fn (works with sage / flash / native backends); introduce AnyFlowAttention( AttentionModuleMixin) with _default_processor_cls / _available_processors; rename processors to AnyFlowAttnProcessor / AnyFlowCrossAttnProcessor and declare _attention_backend / _parallel_config class attrs * drop dead config fields — qk_norm and added_kv_proj_dim are pruned from both transformer __init__ signatures and AnyFlowTransformerBlock; AnyFlowAttention is hardcoded to rms-norm-across-heads (the only scheme the released checkpoints use) and has no add_k_proj path (T2V only) * add _repeated_blocks = ["AnyFlowTransformerBlock"] to both transformer classes for compile_repeated_blocks() support (matches Wan) * annotate prepare_latents with `# Copied from diffusers.pipelines.wan. pipeline_wan.WanPipeline.prepare_latents`; the pipeline-side rearrange to (B, T, C, H, W) layout is moved to the call site State-dict keys are preserved (legacy Attention had identical to_q / to_k / to_v / to_out / norm_q / norm_k naming), so existing AnyFlow checkpoints load bit-exactly into the new AnyFlowAttention class. The HF Hub config-update guide is updated correspondingly: transformer/ config.json now drops qk_norm and added_kv_proj_dim alongside the previous init_far_model / init_flowmap_model / chunk_partition removals. 22 fast CPU tests still pass; ruff format / ruff check / check_copies all clean.
…/head-dim fallbacks + KV-cache dtype + num_timesteps
Phase 3 migrated bidi + cross-attention to dispatch_attention_fn but the FAR
causal path still calls flex_attention directly, which has hard requirements
(CPU compile, head_dim >= 16) that fail on PipelineTesterMixin's tiny dummy
components. Real ckpts (head_dim=128, CUDA) never hit these branches; bit-exact
numerical equivalence with FAR-Dev preserved on all 4 released ckpts (forward
0.00e+00, backward kernel-nondet only, ratio 1.000).
Code fixes:
1. AnyFlowRotaryPosEmbed._forward_compressed_frame / _forward_full_frame now
short-circuit to an empty tensor when num_frames / height / width is 0.
PipelineTesterMixin's dummy VAE has scale_factor_spatial=8, so a 16x16 raw
spatial input becomes a 2x2 latent which then floors to 0 against
compressed_patch_size=(1, 4, 4); the original
`freqs[:0].view(0, k, 1, -1)` reshape was ambiguous in that regime.
2. flex_attention dispatch: split the module-load
`torch.compile(flex_attention, dynamic=True)` into `_flex_attention_eager`
(always available) plus `_flex_attention_compiled`, with a tiny wrapper
that picks compiled for CUDA tensors and eager for CPU. Avoids
torch._inductor C++ codegen failures that broke fast tests after
`pipe.to("cpu")`. CUDA performance unchanged (L10 benchmark: 0.0% delta on
bidi 1.3B fwd, 0.0% delta on FAR causal 1.3B fwd).
3. AnyFlowAttnProcessor (FAR causal branch): when head_dim < 16
(flex_attention's hard minimum) zero-pad q/k/v's last dim to 16 and pass
`scale=1/sqrt(original_head_dim)` to flex_attention. Padded value rows
contribute 0, so trimming the output back is mathematically equivalent.
Released ckpts use head_dim=128 so the branch is never taken in production.
4. pipeline_anyflow_far.encode_kv_cache: replace the hardcoded
`latents.to(torch.bfloat16)` with `self.transformer.dtype`. The hardcoded
bf16 crashed conv3d on dummy fp32 components ("Input type (BFloat16) and
bias type (float) should be the same"); real bf16 ckpts are unaffected.
5. pipeline_anyflow_far._denoise_rollout sets
`self._num_timesteps = (len(chunk_partition) - num_context_chunks) * num_inference_steps`
before the chunk loop, so PipelineTesterMixin.test_callback_cfg's
`pipe.num_timesteps`-based assertion matches the actual number of callback
fires (chunks * NFE) instead of the previous hardcoded num_inference_steps.
Tests:
* test_callback_inputs cannot pass without changing FAR's chunk-wise output
semantics — it zeroes latents on the final step and asserts the *entire*
output buffer is zero, but only the active chunk's slice is overwritten in
a chunk-wise rollout. Marked `@unittest.skip` with a detailed rationale;
callback functionality itself is still covered by test_callback_cfg.
* Full pytest run on tests/pipelines/anyflow/ +
tests/models/transformers/test_models_transformer_anyflow.py +
tests/schedulers/test_scheduler_flow_map_euler_discrete.py: 81 passed,
0 failed, 11 skipped.
Quality gates:
* `ruff check` and `ruff format --check` clean across all AnyFlow files.
* `python utils/check_copies.py` clean.
* `python utils/check_dummies.py` clean.
User-facing alignment with the official HF Hub model card and the day-of-announcement materials at https://huggingface.co/collections/nvidia/anyflow. * Fill in the arXiv identifier 2605.13724 (5 paper links + 2 BibTeX entries). * Rename TV2V → V2V across docs + pipeline_anyflow{,_far}.py so the diffusers copy uses the same Video-to-Video terminology as the official model card. * Add the [nvidia/anyflow](https://huggingface.co/collections/nvidia/anyflow) HF collection link to the three tutorial intros. * Drop the temporary "guyuchao/* staging" tip from the EN tutorial / API page / ZH tutorial — the nvidia/AnyFlow-*-Diffusers repos are now live. * Wire up NVlabs/AnyFlow (training code) and nvlabs.github.io/AnyFlow (project page) in place of the prior <github-org> / <project-page-url> placeholders. * Cite the authors (Yuchao Gu, Guian Fang et al.) and NUS ShowLab × NVIDIA affiliation in the main tutorial, API pipeline page, and both transformer model pages; BibTeX uses the standard `and others` to elide the full list until the next pass. Working tree, CI gates, and tests after the change: ruff format --check ✓ ruff check ✓ python utils/check_copies.py ✓ python utils/check_dummies.py ✓ pytest tests/models + tests/schedulers (22 fast) ✓ No production code logic changes — only docstring wording inside pipeline files (TV2V → V2V).
Replace the placeholder ``@article{gu2026anyflow, author = {Gu, Yuchao and
Fang, Guian and others}, ...}`` block in both the English and Chinese
tutorials with the canonical ``@misc{gu2026anyflowanystepvideodiffusion,
...}`` form from arxiv.org/abs/2605.13724, which lists all seven authors:
Yuchao Gu, Guian Fang, Yuxin Jiang, Weijia Mao, Song Han, Han Cai,
Mike Zheng Shou.
Docs-only.
Scheduler - FlowMapEulerDiscreteScheduler.step now returns a FlowMapEulerDiscreteSchedulerOutput dataclass (or tuple with return_dict=False) and uses the conventional positional order (model_output, timestep, sample, r_timestep). - Drop training-only helpers: adaptive_weighting, set_train_weight, get_train_weight, linear_timesteps_weights, and the weight_type config field. - Add scale_model_input no-op for API parity; raise ValueError on missing r_timestep. Transformer - Remove gate_track debug write inside AnyFlowDualTimestepTextImageEmbedding.forward_timestep. - Compile flex_attention lazily on first CUDA call instead of at import time. - Replace assert with ValueError in build_block_mask. - Resolve <arxiv-id> placeholders to 2605.13724. Pipelines (AnyFlowPipeline + AnyFlowFARPipeline) - Add EXAMPLE_DOC_STRING + @replace_example_docstring and full __call__ docstrings covering every argument. - Move use_mean_velocity from __init__ to __call__ so save/load round-trips. - Drop _denoise_rollout's grad_timestep branch (DMD on-policy training rollout), the inner inference_range closure, and the redundant negative-prompt concat. - Replace asserts with ValueError; wire show_progress to tqdm; rename inference -> _inference; remove dead current_timestep property. - Update scheduler.step call sites to the new signature. - Trim class docstrings to inference-only language. Pipeline output - Add Apache 2.0 license header; switch to relative import. Auto pipeline / conversion script - Register AnyFlowFARPipeline in AUTO_IMAGE2VIDEO_PIPELINES_MAPPING and AUTO_VIDEO2VIDEO_PIPELINES_MAPPING. - Document the weights_only=False requirement in the conversion script. Tests - Scheduler tests use the new step signature and verify the Output dataclass contract. - Drop the four obsolete training-weight tests; drop weight_type kwarg from pipeline test fixtures; remove internal milestone names from TODO comments. Docs - Resolve <arxiv-id> in the scheduler docs page. - Trim DMD / on-policy distillation language in EN/ZH tutorials and the pipelines page; the paper abstract quote is preserved verbatim.
| # Torch-compile mixin intentionally skipped: FAR's `_build_causal_mask` uses | ||
| # `flex_attention.create_block_mask(_compile=False)`, which conflicts with the tracer | ||
| # assumptions made by the standard TorchCompileTesterMixin. The bidi transformer test file | ||
| # covers compile behavior; the FAR causal path is bit-exact-validated end-to-end on H200 | ||
| # through the pipeline replay rather than per-module compile. |
There was a problem hiding this comment.
Suggestion (non-blocking): my understanding is that the underlying cause of the incompatibility between AnyFlowFARTransformer3DModel and TorchCompileTesterMixin is that AnyFlowFARTransformer3DModel.forward calls torch.nn.attention.flex_attention.create_block_mask (via _build_causal_mask) internally. Since _build_causal_mask doesn't depend on the transformer internals, we could refactor this to be a standalone function and build the attention mask outside of the transformer forward method (e.g. in AnyFlowFARPipeline.__call__) and then pass it to forward via a attention_mask: BlockMask argument. This should allow pipe.transformer.compile() (and the compile tests) to work as expected.
There was a problem hiding this comment.
Thanks — this is a good direction. Since you marked it non-blocking and it reshapes the transformer's public attention_mask contract (the pipeline becomes responsible for building the BlockMask, which needs another bit-exact pass to validate), I'd like to defer it to a focused follow-up PR that pairs the _build_causal_mask extraction with re-enabling TorchCompileTesterMixin on the FAR transformer — that way the optimization and its dedicated test land in one go. Will track it as a TODO post-merge.
dg845
left a comment
There was a problem hiding this comment.
Thanks for the changes! I think this PR is close to merge. Left some small comments and suggestions.
|
Thanks for the careful third pass @dg845 — happy to hear we're close. Working through all 9 now; will reply per-thread as each lands. Should be done in ~1h. |
…imesteps schedule dg845 third pass — 7 of 9 comments applied; the 8th (custom sigmas/timesteps support) matches FlowMatchEulerDiscreteScheduler conventions; the 9th (_build_causal_mask refactor) is explicitly marked non-blocking and deferred to a follow-up that also re-enables TorchCompileTesterMixin. Comment cleanups: - transformer_anyflow.py:704 temb output-norm comment: drop redundant 'no ndim==2 branch'. - pipeline_anyflow.py:550 denoise loop comment: '# 6. Denoising loop'. - pipeline_anyflow_far.py:684 denoise loop comment: '# 8. Denoising loop (outer over chunks, inner over timesteps).'. - pipeline_anyflow_far.py:702 drop trailing inline comment on `timesteps = scheduler.timesteps`. - scheduling_flow_map_euler_discrete.py: clearer wording on the off-schedule `r_timestep` error. Custom schedule support: - FlowMapEulerDiscreteScheduler.set_timesteps gains `sigmas` and `timesteps` kwargs mirroring FlowMatchEulerDiscreteScheduler. Default behaviour is unchanged (linspace + shift); the validation + length-N → length-N+1 terminal-0 append are shared with the default path so on-schedule rollouts stay bit-exact. - AnyFlowPipeline.__call__ and AnyFlowFARPipeline.__call__ accept `sigmas` and `timesteps` kwargs, override num_inference_steps from their length, and forward to set_timesteps (matches LTX2Pipeline pattern). - New scheduler tests: test_set_timesteps_custom_sigmas and test_set_timesteps_custom_timesteps cover both override paths. Dtype skip on save/load: - TestAnyFlowTransformer3D and TestAnyFlowFARTransformer3D now skip test_from_save_pretrained_dtype_inference (parametrized over fp16/bf16), mirroring WanTransformer3DModel's skip — the test's tolerance requirements are too high for meaningful signal under AnyFlow's flow-map mixed-precision sampling.
|
Third pass addressed in
Test sweep: 88 passed / 83 skipped / 0 failed on CPU. Ready for another look. |
CI doc-builder style check flagged 3 files with docstring lines >119 chars. Ran 'doc-builder style src/diffusers docs/source --max_len 119' to autoformat; content unchanged, line wrapping only.
|
Hi @claude, can you do a final review of this PR? |
|
Claude finished @dg845's task in 2m 19s —— View job Final Review of AnyFlow PR
Hi @dg845 — here's my final review. The PR is in good shape after three review rounds. Most of the structural and API items from your earlier reviews have been cleanly addressed. I found a few remaining items — one is a pending suggestion from you that hasn't been applied yet, and the rest are minor. Blocking1. Unaddressed review suggestion: Your most recent suggestion on # Current:
full_sigmas = torch.cat([working_sigmas, torch.zeros(1, dtype=working_sigmas.dtype)])
# Should be:
full_sigmas = torch.cat([working_sigmas, working_sigmas.new_zeros(1)])You noted this prevents test failures on CUDA in Non-blocking2. In 3. Both # Copied from diffusers.pipelines.anyflow.pipeline_anyflow.AnyFlowPipeline.encode_video4. Similarly, 5. FAR pipeline's Per pipelines.md gotcha #6: "If a method is only used by another method, make it private". 6. Minor: In Suggestions / additional info: dead code traceI traced the call paths from both pipelines through the transformers to identify likely dead code. Bidirectional pipeline →
FAR pipeline →
All of these are fine as advisory — they're either training paths or extension points for future I2V support. No action needed. SummaryThe PR is well-structured and the code quality is high. The transformer split is clean, the |
…anup dg845 blocking suggestion (r3287274209): - scheduling_flow_map_euler_discrete.py:185 — use `working_sigmas.new_zeros(1)` instead of `torch.zeros(1, dtype=...)` so the appended terminal sigma inherits both device and dtype from working_sigmas. The current working_sigmas always starts on CPU so the device mismatch is latent, but new_zeros is the correct defensive pattern and matches how the published FAR test fixtures run on CUDA. Claude bot final-review follow-ups: - transformer_anyflow_far.py: drop three stale `# step 3: generate attention mask` comments left over from the original numbered-step structure (bot huggingface#6). - pipeline_anyflow_far.py: annotate `encode_video` with `# Copied from diffusers.pipelines.anyflow.pipeline_anyflow.AnyFlowPipeline.encode_video` and align docstring + inline comment so `make fix-copies` keeps them in sync (bot huggingface#3). Skipped (not real / judgment-call): - bot huggingface#2 (private rename of `_forward_far_patchify*`) — already done in 84605d5; bot was looking at a stale snapshot. - bot huggingface#4 (check_inputs `# Copied from`) — FAR's check_inputs has an extra `(num_frames - 1) % 4 == 0` constraint that doesn't map onto the bidi version, so a clean `# Copied from` link would require restructuring. Bot called it a consistency nit; leaving as-is. - bot huggingface#5 (`encode_kv_cache` → `_encode_kv_cache`) — bot itself flagged this as judgment-call territory; the helper is a coherent operation that advanced inference callers may want to invoke directly.
dg845
left a comment
There was a problem hiding this comment.
Thanks for your hard work on this PR!
|
Thanks so much for the careful, patient guidance across all four review rounds, @dg845 — really appreciate the time you put in. Excited to see AnyFlow land! 🚀 |
|
Merging as the CI failures are unrelated. |
VideoProcessor.preprocess_video's 5D contract is (B, T, C, H, W) — the diffusers AnyFlow PR aligned its docstring + EXAMPLE_DOC_STRING with this in the third review pass (huggingface/diffusers#13745, commits ffdc969 and downstream). This README's I2V example still showed (B, C, T, H, W) and the matching unsqueeze(2); update both so users following the README verbatim get a tensor the diffusers pipeline accepts.
…Flow classes The training pipeline (far/main.py:save_checkpoint) emits .pt files keyed by 'ema' / 'model_state_dict_g'; the diffusers pipelines load from a structured directory written by pipeline.save_pretrained(). Until now this conversion script wrapped the .pt into a pipeline built from this repository's WanAnyFlowPipeline / FARWanAnyFlowPipeline / FAR_Wan_Transformer3DModel / FlowMapDiscreteScheduler — so the resulting model_index.json referenced far.* paths that diffusers.from_pretrained couldn't resolve. Switch the conversion to the diffusers AnyFlow classes (introduced in huggingface/diffusers#13745): - AnyFlowTransformer3DModel (bidirectional T2V variants) - AnyFlowFARTransformer3DModel (FAR causal variants) - AnyFlowPipeline / AnyFlowFARPipeline - FlowMapEulerDiscreteScheduler Output directories now load via AnyFlowPipeline.from_pretrained(...) with no compat shim. The CLI surface (OmegaConf model_type / model_path / model_save_dir keys + auto-append of model_type to the save dir) is preserved. Tensor keys are unchanged across FAR_Wan_Transformer3DModel and the diffusers AnyFlow classes (bit-exact L2=0 against the released NVlabs checkpoints), so load_state_dict(strict=False) handles the EMA bookkeeping fields without dropping any real weights.
…m AnyFlow merge Following huggingface/diffusers#13745 (AnyFlow merged into diffusers >= 0.36), align the in-repo classes and their call sites with the upstream API contract. Scheduler — FlowMapDiscreteScheduler: - `step()` returns `FlowMapEulerDiscreteSchedulerOutput(prev_sample=...)` instead of a bare tensor, with the diffusers-standard arg order `(model_output, timestep, sample, ...)` and a `return_dict=True` tuple fallback. - `set_timesteps()` drops the trailing `0` from `linspace(1.0, 0.0, n+1)`. Model — FAR_Wan_Transformer3DModel: - Remove the `init_far_model` / `init_flowmap_model` boolean config flags. Setup is now inferred from whether `compressed_patch_size` / `deltatime_type` are provided — matches the upstream `AnyFlow*Transformer3DModel` shape. - `setup_flowmap_model()` reads `gate_value` / `deltatime_type` from `self.config` (no per-call args). - `setup_far_model()` rebuilds the rotary embedding with the compressed patch size (fixes the rope / patch-size mismatch when FAR is enabled). Call-site adapters (no logic change): - WanAnyFlowPipeline / FARWanAnyFlowPipeline: use the new `step()` signature and `.prev_sample`; append a trailing `0` to `timesteps` in the inference loop to compensate for the drop in `set_timesteps()`. - On-policy trainers: three DMD-scheduler call sites updated to the new arg order. - Pretrain trainers: infer setup from the presence of `compressed_patch_size` / `deltatime_type` instead of the removed boolean flags. - demo.py and scripts/convert_model/convert_anyflow_to_diffusers.py: same config-shape migration.
…up (#13792) * [AnyFlow] FAR: standalone causal-mask builder + torch.compile follow-up Follow-up to #13745. Extracts FAR mask construction to a module-level helper and adds an `attention_mask` forward kwarg so AnyFlowFARTransformer3DModel can be wrapped in `torch.compile(fullgraph=True)`. The pipeline pre-builds the mask during KV-cache prefill so users get end-to-end fullgraph compile. * Public method `AnyFlowFARTransformer3DModel.build_attention_mask(...)` (modes: "train", "cache") plus private module-level helper `_build_anyflow_far_causal_block_mask(...)`. * `_build_freqs` cache lookup/write bypassed under `torch.compiler.is_compiling()` to avoid a Dynamo guard recompile on the second compiled call (applied in bidi source; synced to FAR via `# Copied from`). * `TestAnyFlowFARTransformer3DCompile(TorchCompileTesterMixin)` — recompilation_and_graph_break, repeated_blocks, and group_offloading pass on H200; AOT is `@pytest.mark.skip`'d (torch.export rejects BlockMask as a pytree input). * Base `get_dummy_inputs` omits `attention_mask` so every non-compile test class exercises the in-forward fallback; the compile class overrides to inject a pre-built mask. * Bit-exact: pre-built path vs internal-build fallback max|Δ|=0.0e+00. * [AnyFlow] docs: full author list, repo demo examples, slimmer pipeline page * Full author list and NVIDIA → NUS → MIT institution order; TL;DR + abstract + Available Models bullets. * Rewritten pipeline-selection tip describing both pipelines symmetrically. * T2V / I2V / V2V examples now use the canonical 81-frame setup and the demo prompts / conditioning assets shipped under `NVlabs/AnyFlow/assets/evaluation/` (linked via raw.githubusercontent.com). * Drop the inline "Optimizing Memory" and "torch.compile" sections — those notes will live in the NVlabs/AnyFlow repo's own performance guide rather than the diffusers pipeline reference. * Sync zh user guide and the two model-API stubs. * [AnyFlow] FAR: move chunk_partition default into transformer config - AnyFlowFARTransformer3DModel.__init__ now accepts chunk_partition via @register_to_config (default (1, 3, 3, 3, 3, 3, 3, 2) for the released 81-frame checkpoints, matching the field on Hub). - AnyFlowFARPipeline.__call__ no longer requires chunk_partition; defaults to self.transformer.config.chunk_partition. Per-call override still supported for V2V / non-default num_frames. - Drop the AnyFlowFARPipeline.default_chunk_partition class attribute. - Update docs (en pipelines/models, zh using-diffusers) and the conversion script to match. * [AnyFlow] FAR pipeline: fix `timesteps` shadowing across chunks Inside the per-chunk rollout loop, the local variable `timesteps` was reassigned to `self.scheduler.timesteps` after `set_timesteps()`. On the next chunk iteration the same name was passed back into `set_timesteps(timesteps=...)`, where a non-None value enters the *custom-schedule* branch — `apply_shift` re-runs on already-shifted values, double-shifting the schedule for every chunk after the first. Concretely, with `shift=5` and `num_inference_steps=4`: - chunk 0 timesteps: [1000, 937.5, 833.3, 625] (correct) - chunk 1+ timesteps: [1000, 986.8, 961.3, 892.9] (double-shifted) The later steps drift toward `t=1000` instead of toward `t=0`, the flow-map model is conditioned on the wrong source sigma, and the chunk KV cache accumulates errors that show up as artifacts in later video frames. Fix: rebind the cached schedule to a fresh local name (`scheduler_timesteps`) so the outer-scope `timesteps` kwarg (the user-provided custom schedule, when any) stays untouched across chunks. Layer-by-layer verification against the NVlabs reference implementation on H200 (elephant prompt, seed 0, 4 NFE, 81 frames): - chunk 0 inference: bit-exact (0.0 mean diff) - chunk 1 step 0: 0.194 → 0.014 (-93%) - chunk 7 last step: 0.564 → 0.274 (-51%) * [AnyFlow] FAR: doc-builder line wrap for chunk_partition docstrings Pure rewrap to satisfy `doc-builder style --max_len 119`. Two docstrings introduced in 96077b2 (the `chunk_partition` config arg on the FAR transformer + the matching pipeline kwarg) wrapped a few characters short of the line budget. No semantic change. * [AnyFlow] docs: drop author names from docstrings, link FAR via HF papers, say chunk-wise - Remove author-name attributions from the transformer / pipeline class docstrings and file-header comments; the paper-citation header on the doc page keeps the full author list, the in-code references just point at the [AnyFlow] / [FAR] papers. - Link FAR via its Hugging Face papers page (https://huggingface.co/papers/2503.19325) instead of a raw arxiv.org URL, matching the AnyFlow reference style and the rest of the diffusers docs. - Describe AnyFlow FAR generation as "chunk-wise autoregressive": the pipeline autoregresses over chunks (`chunk_partition`), not single frames. * [AnyFlow] FAR: address review nits - pipeline: reuse the standard `timesteps` variable name for the per-chunk scheduler timesteps; freeze the caller-provided custom schedule in `custom_timesteps`/`custom_sigmas` before the loop so it isn't re-fed into `set_timesteps` and double-shifted on later chunks. - transformer: clarify the no-mask fallback comment to spell out the `torch.compile(fullgraph=True)` graph-break behavior and the `build_attention_mask` workaround. --------- Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
…ausal) (#13745) * [Pipelines] AnyFlow: scaffold pipelines/anyflow + register all top-level imports This is the lazy-loader scaffolding only. Body files (pipeline_anyflow.py, pipeline_anyflow_causal.py, transformer_anyflow.py, scheduling_flow_map_euler_discrete.py) come in subsequent commits. * [Schedulers] AnyFlow: add FlowMapEulerDiscreteScheduler The flow-map scheduler advances samples from timestep t to caller-provided target r in a single Euler step, supporting any-step sampling on flow-map- distilled checkpoints. It is a general-purpose scheduler — not specific to the AnyFlow checkpoints. Tests: 12 standalone tests covering instantiation, set_timesteps endpoints, shift identity/monotonicity, step shape preservation, zero-interval identity, one-shot sampling, train weight schemes, scale_noise endpoints. Docs: api/schedulers/flow_map_euler_discrete.md * [Models] AnyFlow: add AnyFlowTransformer3DModel A 3D DiT extending the v0.35.1 Wan2.1 backbone with two config-toggled modules: * FAR causal blocks (init_far_model=True): block-sparse causal attention via flex_attention + compressed-frame patch embedding for frame-level autoregressive generation (Gu et al., 2025, arXiv:2503.19325). * Dual-timestep flow-map embedding (init_flowmap_model=True): adds a delta timestep embedder enabling flow-map sampling z_t -> z_r over arbitrary intervals (AnyFlow). With both flags off, the model reduces to stock Wan2.1. The class is intentionally self-contained rather than annotated with '# Copied from diffusers.models.transformers.transformer_wan' because upstream Wan has been refactored extensively since v0.35.1 (new WanAttention class, different processor architecture). Tests: 9 unit tests covering construction in 3 modes, bidi forward shape and determinism, return_dict variants, save/load round-trip with and without init_far_model, gradient checkpointing toggle. Docs: api/models/anyflow_transformer3d.md * [Pipelines] AnyFlow: add AnyFlowPipeline and AnyFlowCausalPipeline * AnyFlowPipeline (pipeline_anyflow.py, ~590 LOC): bidirectional T2V using flow-map sampling. Loads checkpoints from nvidia/AnyFlow-Wan2.1-T2V-{1.3B,14B}. * AnyFlowCausalPipeline (pipeline_anyflow_causal.py, ~700 LOC): FAR-based causal pipeline supporting T2V/I2V/TV2V via task_type kwarg. Loads checkpoints from nvidia/AnyFlow-FAR-Wan2.1-{1.3B,14B}-Diffusers. Both pipelines reuse stock WanLoraLoaderMixin, AutoencoderKLWan, UMT5EncoderModel, and AutoTokenizer from upstream. The transformer is the AnyFlowTransformer3DModel introduced in the previous commit. The scheduler is FlowMapEulerDiscreteScheduler. Tests: * tests/pipelines/anyflow/test_anyflow.py: PipelineTesterMixin fast tests + slow integration test against nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers. * tests/pipelines/anyflow/test_anyflow_causal.py: same structure for FAR variant. Reference slices for slow integration tests are deferred to Phase 7 (Final quality pass) where the user runs them on a real GPU. * [Docs] AnyFlow: add main pipeline documentation page Modeled on the Helios pipeline doc (PR #13208). Sections: paper link + abstract, supported checkpoints table, memory/speed optimization tabs, T2V/I2V/TV2V examples for both bidirectional and causal variants, autodoc trailers. * [Auto/Scripts] AnyFlow: register AutoPipelineForText2Video + add conversion script * Register AnyFlowPipeline in AUTO_TEXT2VIDEO_PIPELINES_MAPPING. * AnyFlowCausalPipeline is intentionally NOT registered for AutoPipeline because its task switch (t2v / i2v / tv2v) is too rich for a single auto-resolve key. * scripts/convert_anyflow_to_diffusers.py: convert .pt training checkpoints (with 'ema' state dict) into a diffusers save_pretrained layout. Supports all 4 released NVIDIA AnyFlow variants. Replaces the omegaconf-based config in the upstream repo with argparse to match other diffusers conversion scripts. * [Quality] AnyFlow: ruff-format + regenerated dummy stubs * ruff format pass on all 5 source files (long lines + trailing comma fixes) * check_dummies.py --fix_and_overwrite regenerated: - dummy_pt_objects.py: AnyFlowTransformer3DModel + FlowMapEulerDiscreteScheduler - dummy_torch_and_transformers_objects.py: AnyFlowPipeline + AnyFlowCausalPipeline Local fast tests: 21/21 passed - 12 scheduler tests (FlowMapEulerDiscreteScheduler) - 9 transformer tests (AnyFlowTransformer3DModel construction + bidi forward + save/load) The pipeline fast tests in tests/pipelines/anyflow/ require a local dev install that matches the diffusers main branch's transformers >= compatibility floor. The reference slices for slow integration tests (real GPU + 1.3B/14B checkpoints) are intentionally left as TODO stubs to be captured by the user on a real GPU machine before opening the PR. * [AnyFlow] address review feedback: bug fixes + DMD wording + EN/ZH tutorials Critical bug fixes (verified against precision-validation review): * pipeline_anyflow.py / pipeline_anyflow_causal.py: replace hardcoded transformer_dtype = torch.bfloat16 with self.transformer.dtype, so pipe.to("cpu") and PipelineTesterMixin save/load tests do not crash on a dtype mismatch in the patch_embedding conv3d. * transformer_anyflow.py: drop the duplicate `base = base = ...` assignment in _build_causal_mask (was a copy-paste typo carried over from FAR-Dev). * transformer_anyflow.py: drop unused `q_is_context` / `k_is_context` locals and the `# noqa: F841` markers that were silencing the dead-store warning. * transformer_anyflow.py: remove `CacheMixin` from the inheritance list — the pipeline manages KV cache directly, the mixin's interface is unused. * transformer_anyflow.py: guard the module-level `torch.compile(flex_attention)` with try/except so the file imports cleanly on CPU CI / no-Triton machines. * convert_anyflow_to_diffusers.py: replace ad-hoc print warnings with the stdlib logger (warning_once-style) and a module-level basicConfig. Documentation accuracy: * AnyFlowCausalPipeline class docstring + main pipeline doc + EN/ZH tutorial: drop the fictitious `task_type` / `image` / `video` arguments and document the real API: pass `context_sequence={"raw": tensor}` (or `{"latent": ...}`) to switch between T2V (None) / I2V (1-frame) / TV2V (4n+1-frame) modes. * Pipeline class docstrings + main doc: explicitly describe AnyFlow's two-stage LoRA distillation including DMD reverse-divergence supervision with Flow-Map backward simulation in stage 2 (was previously implicit). * training_rollout: add detailed docstring explaining its role as the 3-segment Flow-Map backward simulation entry point used during DMD training. * Long-form tutorial doc `using-diffusers/anyflow.md` (EN, 239 LOC) and Chinese mirror `docs/source/zh/using-diffusers/anyflow.md` (224 LOC) added and registered in both `_toctree.yml` files. Tests: * Skip `test_attention_slicing_forward_pass` in both pipeline test classes with a clear rationale (custom attention processor does not support slicing). * All 21 standalone tests still pass (12 scheduler + 9 transformer). Quality gates: * `ruff check` clean across all AnyFlow files. * `ruff format --check` reports 6 files already formatted. * `python utils/check_copies.py` reports no diff. Out of scope for this commit (deferred until reviewer feedback): * Splitting AnyFlowTransformer3DModel into bidi + causal subclasses * Unifying _forward_inference / _forward_cache return types * Migrating model tests from plain unittest to BaseModelTesterConfig + mixins * HF model card / config.json metadata updates on the nvidia/* repos (push to Hub manually before opening the PR) * [AnyFlow] rename Causal->FAR + explicit forward signature + dataclass output Round 2 of review feedback. Three groups of changes; transformer state-dict keys, module hierarchy, and tensor flow are unchanged so the H200 bit-exact validation remains valid. A. Pipeline rename (mechanical, no behavior change): * Class: AnyFlowCausalPipeline -> AnyFlowFARPipeline (Causal in diffusers usually means an attention mask; AnyFlow's variant is FAR autoregressive, so the FAR name is more specific and matches the paper). * File: pipeline_anyflow_causal.py -> pipeline_anyflow_far.py (git mv). * Test file: test_anyflow_causal.py -> test_anyflow_far.py (git mv). * All references updated in src/, tests/, docs/, scripts/, plus stale anyflowcausalpipeline anchor links in tutorial markdown. B. Pipeline test bug fixes (closes 19 fast-test failures reported by precision-validation reviewer): * pipeline_anyflow.py / pipeline_anyflow_far.py: __call__ now sets self._num_timesteps = num_inference_steps before the rollout, so the PipelineTesterMixin callback tests can read pipe.num_timesteps. * tests/pipelines/anyflow/test_anyflow_far.py: drop the fictitious task_type="t2v" kwarg that crashed every causal fast test (the FAR pipeline selects mode via context_sequence, not a task_type arg). C. Transformer architecture cleanups (review-driven, no tensor changes): * Replace forward(*args, **kwargs) dispatcher with an explicit signature listing every supported kwarg (hidden_states, timestep, r_timestep, encoder_hidden_states, encoder_hidden_states_image, chunk_partition, clean_hidden_states, clean_timestep, kv_cache, kv_cache_flag, is_causal, attention_kwargs, return_dict). Helps IDE / type-checker / torch.compile tracing. * Drop SimpleNamespace returns. Add AnyFlowFARTransformerOutput (BaseOutput dataclass with sample + kv_cache fields) for the two causal paths that need to also propagate kv_cache (_forward_inference and the newly return_dict-aware _forward_cache). _forward_train and _forward_bidirection now consistently return Transformer2DModelOutput. Pipeline call sites already use return_dict=False with positional unpacking, so the fix is transparent there. Out of scope (deferred until canonical-org HF metadata sync): * Splitting AnyFlowTransformer3DModel into a bidi class plus an AnyFlowFARTransformer3DModel subclass — touches register_to_config keys and would require updating model_index.json on every released checkpoint. * Promoting chunk_partition from register_to_config to a forward-time argument (same reason). * Renaming training_rollout to _denoise — would break callers in the FAR-Dev on-policy trainer that produced the released checkpoints. Local fast tests: 21/21 still pass (12 scheduler + 9 transformer). ruff check, ruff format, and check_copies.py are all clean. * [AnyFlow] wire callback_on_step_end through inference_range + add chunk_partition to FAR fast-test fixture Two root causes for the 19 remaining PipelineTesterMixin failures, identified by the H200 reviewer: 1. callback_on_step_end was accepted by __call__ but never invoked. Both pipelines pass it through to training_rollout (and FAR additionally through inference()), and inference_range now fires it after scheduler.step in the standard inference branch: if callback_on_step_end is not None: callback_kwargs = {k: locals()[k] for k in callback_on_step_end_tensor_inputs} callback_outputs = callback_on_step_end(self, i, t, callback_kwargs) latents = callback_outputs.pop("latents", latents) prompt_embeds = ... negative_prompt_embeds = ... `nonlocal prompt_embeds, negative_prompt_embeds` lets the callback rewrite the closure-captured embeddings, matching upstream WanPipeline semantics. The 3-segment grad_timestep training rollout does not invoke the callback; it is intentionally training-only. 2. tests/pipelines/anyflow/test_anyflow_far.py::get_dummy_components built the dummy transformer without a `chunk_partition`, leaving it None on the model config and crashing the pipeline at `sum(self.transformer.config.chunk_partition)`. Set `chunk_partition=[1, 1, 1]` in the fixture (3 chunks of 1 latent frame each, matching the test's num_frames=9 -> 3 latent frames). Local fast tests: 21/21 still pass. ruff check, ruff format, and check_copies.py are all clean. * [AnyFlow] Phase 2: split transformer + drop chunk_partition from config + rename helpers Major architectural refactor that aligns the integration with diffusers conventions ahead of the canonical-org Hub upload. State-dict keys, module hierarchy, and tensor flow are unchanged so the H200 bit-exact validation remains valid; only the on-disk transformer/config.json fields move. Changes: 1. **Sibling transformer classes** replace the flag-driven single class: * AnyFlowTransformer3DModel — bidirectional only. Drops compressed_patch_size / full_chunk_limit / init_far_model / init_flowmap_model / chunk_partition kwargs (always-on for AnyFlow distilled checkpoints). * AnyFlowFARTransformer3DModel — adds far_patch_embedding + the 3 FAR forward paths (train / cache-prefill / autoregressive inference). * AnyFlowTimeTextImageEmbedding (the legacy single-time embedder used only by the old setup_flowmap_model bootstrap) is removed; both classes now build AnyFlowDualTimestepTextImageEmbedding directly in __init__. * setup_flowmap_model / setup_far_model methods are removed; weight warm-start for far_patch_embedding (trilinear interpolation from patch_embedding) moves into AnyFlowFARTransformer3DModel.__init__. 2. **chunk_partition** is no longer a model config field. The FAR pipeline owns the schedule: * AnyFlowFARPipeline.default_chunk_partition = [1, 3, 3, 3, 3, 3, 3, 2] matches the released 81-frame NVIDIA checkpoints. * AnyFlowFARPipeline.__call__ / _denoise_rollout accept a chunk_partition argument that overrides the default for non-default num_frames. 3. **training_rollout -> _denoise_rollout** rename across both pipelines and all English / Chinese docs that referenced it. Signals the method is internal to the pipeline driver, not a public training API. 4. **Conversion script + tests + docs + registries**: * scripts/convert_anyflow_to_diffusers.py: VARIANTS dict picks the right transformer class per variant; init_far_model / init_flowmap_model / chunk_partition kwargs are removed from the from_pretrained call. * Transformer test file split into AnyFlowTransformer3DModelTest and AnyFlowFARTransformer3DModelTest classes. * Pipeline test fixtures use the right class and pass chunk_partition via get_dummy_inputs (3-frame schedule [1, 1, 1] for the 9-frame test). * New docs page docs/source/en/api/models/anyflow_far_transformer3d.md; anyflow_transformer3d.md rewritten for the bidi-only class. * AnyFlowFARTransformer3DModel registered in src/diffusers/__init__.py, src/diffusers/models/__init__.py, models/transformers/__init__.py and the dummy_pt_objects.py stubs. * docs/source/en/_toctree.yml: new entry for the FAR transformer page. 5. **Cleanups**: * Pipeline __call__ no longer passes is_causal=False to the bidi forward (the bidi class doesn't accept it). * Pipeline class docstrings drop stale references to init_*_model flags. Local tests: 22/22 pass (12 scheduler + 10 transformer covering both classes). ruff check / format / check_copies clean. Hub artifacts (model_index.json, transformer/config.json, scheduler config) need to be regenerated for the released checkpoints; the HF update guide will be delivered separately. * [AnyFlow] Phase 3: convention compliance against .ai/AGENTS.md + .ai/models.md Hard violations (per official diffusers guidelines): * drop einops dependency — replace 25+ rearrange() calls with native permute/reshape/unflatten in transformer + both pipelines * device-gate torch.float64 — apply_rotary_emb and AnyFlowRotaryPosEmbed now fall back to float32 / complex64 on MPS / NPU; freqs are lazily rebuilt per-device via _build_freqs (matches transformer_wan / transformer_flux pattern) * migrate attention to dispatch_attention_fn — replace direct F.scaled_dot_product_attention calls with dispatch_attention_fn (works with sage / flash / native backends); introduce AnyFlowAttention( AttentionModuleMixin) with _default_processor_cls / _available_processors; rename processors to AnyFlowAttnProcessor / AnyFlowCrossAttnProcessor and declare _attention_backend / _parallel_config class attrs * drop dead config fields — qk_norm and added_kv_proj_dim are pruned from both transformer __init__ signatures and AnyFlowTransformerBlock; AnyFlowAttention is hardcoded to rms-norm-across-heads (the only scheme the released checkpoints use) and has no add_k_proj path (T2V only) * add _repeated_blocks = ["AnyFlowTransformerBlock"] to both transformer classes for compile_repeated_blocks() support (matches Wan) * annotate prepare_latents with `# Copied from diffusers.pipelines.wan. pipeline_wan.WanPipeline.prepare_latents`; the pipeline-side rearrange to (B, T, C, H, W) layout is moved to the call site State-dict keys are preserved (legacy Attention had identical to_q / to_k / to_v / to_out / norm_q / norm_k naming), so existing AnyFlow checkpoints load bit-exactly into the new AnyFlowAttention class. The HF Hub config-update guide is updated correspondingly: transformer/ config.json now drops qk_norm and added_kv_proj_dim alongside the previous init_far_model / init_flowmap_model / chunk_partition removals. 22 fast CPU tests still pass; ruff format / ruff check / check_copies all clean. * [AnyFlow] FAR fast-test compat: rope 0-dim guard + flex_attention CPU/head-dim fallbacks + KV-cache dtype + num_timesteps Phase 3 migrated bidi + cross-attention to dispatch_attention_fn but the FAR causal path still calls flex_attention directly, which has hard requirements (CPU compile, head_dim >= 16) that fail on PipelineTesterMixin's tiny dummy components. Real ckpts (head_dim=128, CUDA) never hit these branches; bit-exact numerical equivalence with FAR-Dev preserved on all 4 released ckpts (forward 0.00e+00, backward kernel-nondet only, ratio 1.000). Code fixes: 1. AnyFlowRotaryPosEmbed._forward_compressed_frame / _forward_full_frame now short-circuit to an empty tensor when num_frames / height / width is 0. PipelineTesterMixin's dummy VAE has scale_factor_spatial=8, so a 16x16 raw spatial input becomes a 2x2 latent which then floors to 0 against compressed_patch_size=(1, 4, 4); the original `freqs[:0].view(0, k, 1, -1)` reshape was ambiguous in that regime. 2. flex_attention dispatch: split the module-load `torch.compile(flex_attention, dynamic=True)` into `_flex_attention_eager` (always available) plus `_flex_attention_compiled`, with a tiny wrapper that picks compiled for CUDA tensors and eager for CPU. Avoids torch._inductor C++ codegen failures that broke fast tests after `pipe.to("cpu")`. CUDA performance unchanged (L10 benchmark: 0.0% delta on bidi 1.3B fwd, 0.0% delta on FAR causal 1.3B fwd). 3. AnyFlowAttnProcessor (FAR causal branch): when head_dim < 16 (flex_attention's hard minimum) zero-pad q/k/v's last dim to 16 and pass `scale=1/sqrt(original_head_dim)` to flex_attention. Padded value rows contribute 0, so trimming the output back is mathematically equivalent. Released ckpts use head_dim=128 so the branch is never taken in production. 4. pipeline_anyflow_far.encode_kv_cache: replace the hardcoded `latents.to(torch.bfloat16)` with `self.transformer.dtype`. The hardcoded bf16 crashed conv3d on dummy fp32 components ("Input type (BFloat16) and bias type (float) should be the same"); real bf16 ckpts are unaffected. 5. pipeline_anyflow_far._denoise_rollout sets `self._num_timesteps = (len(chunk_partition) - num_context_chunks) * num_inference_steps` before the chunk loop, so PipelineTesterMixin.test_callback_cfg's `pipe.num_timesteps`-based assertion matches the actual number of callback fires (chunks * NFE) instead of the previous hardcoded num_inference_steps. Tests: * test_callback_inputs cannot pass without changing FAR's chunk-wise output semantics — it zeroes latents on the final step and asserts the *entire* output buffer is zero, but only the active chunk's slice is overwritten in a chunk-wise rollout. Marked `@unittest.skip` with a detailed rationale; callback functionality itself is still covered by test_callback_cfg. * Full pytest run on tests/pipelines/anyflow/ + tests/models/transformers/test_models_transformer_anyflow.py + tests/schedulers/test_scheduler_flow_map_euler_discrete.py: 81 passed, 0 failed, 11 skipped. Quality gates: * `ruff check` and `ruff format --check` clean across all AnyFlow files. * `python utils/check_copies.py` clean. * `python utils/check_dummies.py` clean. * [AnyFlow] docs/code: paper-release tidy-up User-facing alignment with the official HF Hub model card and the day-of-announcement materials at https://huggingface.co/collections/nvidia/anyflow. * Fill in the arXiv identifier 2605.13724 (5 paper links + 2 BibTeX entries). * Rename TV2V → V2V across docs + pipeline_anyflow{,_far}.py so the diffusers copy uses the same Video-to-Video terminology as the official model card. * Add the [nvidia/anyflow](https://huggingface.co/collections/nvidia/anyflow) HF collection link to the three tutorial intros. * Drop the temporary "guyuchao/* staging" tip from the EN tutorial / API page / ZH tutorial — the nvidia/AnyFlow-*-Diffusers repos are now live. * Wire up NVlabs/AnyFlow (training code) and nvlabs.github.io/AnyFlow (project page) in place of the prior <github-org> / <project-page-url> placeholders. * Cite the authors (Yuchao Gu, Guian Fang et al.) and NUS ShowLab × NVIDIA affiliation in the main tutorial, API pipeline page, and both transformer model pages; BibTeX uses the standard `and others` to elide the full list until the next pass. Working tree, CI gates, and tests after the change: ruff format --check ✓ ruff check ✓ python utils/check_copies.py ✓ python utils/check_dummies.py ✓ pytest tests/models + tests/schedulers (22 fast) ✓ No production code logic changes — only docstring wording inside pipeline files (TV2V → V2V). * [AnyFlow] docs: drop in official BibTeX (full author list) Replace the placeholder ``@article{gu2026anyflow, author = {Gu, Yuchao and Fang, Guian and others}, ...}`` block in both the English and Chinese tutorials with the canonical ``@misc{gu2026anyflowanystepvideodiffusion, ...}`` form from arxiv.org/abs/2605.13724, which lists all seven authors: Yuchao Gu, Guian Fang, Yuxin Jiang, Weijia Mao, Song Han, Han Cai, Mike Zheng Shou. Docs-only. * [AnyFlow] align with diffusers conventions + drop training-only code Scheduler - FlowMapEulerDiscreteScheduler.step now returns a FlowMapEulerDiscreteSchedulerOutput dataclass (or tuple with return_dict=False) and uses the conventional positional order (model_output, timestep, sample, r_timestep). - Drop training-only helpers: adaptive_weighting, set_train_weight, get_train_weight, linear_timesteps_weights, and the weight_type config field. - Add scale_model_input no-op for API parity; raise ValueError on missing r_timestep. Transformer - Remove gate_track debug write inside AnyFlowDualTimestepTextImageEmbedding.forward_timestep. - Compile flex_attention lazily on first CUDA call instead of at import time. - Replace assert with ValueError in build_block_mask. - Resolve <arxiv-id> placeholders to 2605.13724. Pipelines (AnyFlowPipeline + AnyFlowFARPipeline) - Add EXAMPLE_DOC_STRING + @replace_example_docstring and full __call__ docstrings covering every argument. - Move use_mean_velocity from __init__ to __call__ so save/load round-trips. - Drop _denoise_rollout's grad_timestep branch (DMD on-policy training rollout), the inner inference_range closure, and the redundant negative-prompt concat. - Replace asserts with ValueError; wire show_progress to tqdm; rename inference -> _inference; remove dead current_timestep property. - Update scheduler.step call sites to the new signature. - Trim class docstrings to inference-only language. Pipeline output - Add Apache 2.0 license header; switch to relative import. Auto pipeline / conversion script - Register AnyFlowFARPipeline in AUTO_IMAGE2VIDEO_PIPELINES_MAPPING and AUTO_VIDEO2VIDEO_PIPELINES_MAPPING. - Document the weights_only=False requirement in the conversion script. Tests - Scheduler tests use the new step signature and verify the Output dataclass contract. - Drop the four obsolete training-weight tests; drop weight_type kwarg from pipeline test fixtures; remove internal milestone names from TODO comments. Docs - Resolve <arxiv-id> in the scheduler docs page. - Trim DMD / on-policy distillation language in EN/ZH tutorials and the pipelines page; the paper abstract quote is preserved verbatim. * [AnyFlow] split FAR causal transformer into transformer_anyflow_far.py Per @dg845's review on #13745: extract FAR causal modules into a dedicated sibling file so each transformer variant reads in isolation. Shared submodules are duplicated via `# Copied from` so `make fix-copies` keeps both in sync. - `transformer_anyflow.py`: bidi-only. `AnyFlowAttnProcessor` no longer carries the flex/KV-cache branch (was: dispatch in one branch, bare flex_attention in the other); `AnyFlowRotaryPosEmbed` drops the compressed-frame helpers and the `is_causal` arg; `AnyFlowDualTimestepTextImageEmbedding` drops its causal branch. `AnyFlowTransformerBlock` keeps a single class with a new `is_causal: bool = False` ctor flag that selects the self-attn processor — the forward path is identical in both modes, only the processor differs. - `transformer_anyflow_far.py`: new. Contains `AnyFlowFARTransformerOutput`, `AnyFlowCausalAttnProcessor` (routed through `dispatch_attention_fn(backend= "flex")` with a clear ValueError when a non-flex backend is configured; the BlockMask is consumed only by the flex backend in `_native_flex_attention`), `AnyFlowDualTimestepTextImageEmbeddingCausal`, `AnyFlowCausalRotaryPosEmbed`, `AnyFlowFARTransformer3DModel`, and `# Copied from` clones of the shared shared `AnyFlowAttention`/`AnyFlowCrossAttnProcessor`/`AnyFlowImageEmbedding`/ `AnyFlowTransformerBlock`/`AnyFlowAttnProcessor` modules. Verified bit-exact against the pre-refactor branch on H200 (float32): - bidi: L2 = 0.000e+00, max|Δ| = 0.000e+00 - FAR : L2 = 4.772e-06, max|Δ| = 3.576e-07 The FAR delta is fp32 accumulation noise from the dispatch path permuting (B,L,H,D) ↔ (B,H,L,D) around the same `flex_attention` kernel. Addresses review comments at transformer_anyflow.py:215, :261, :450, :622, :671, :958. * [AnyFlow] pipeline cleanup: video_processor, encode_video, inline rollout, kwarg rename Per @dg845's review on #13745, applied to both bidi `AnyFlowPipeline` and causal `AnyFlowFARPipeline`: - Use `self.video_processor.preprocess_video(...)` instead of the manual `* 2 - 1` normalize. - Merge `vae_encode` + `encode_latents` + `_normalize_latents` into a single `encode_video` method, mirroring `WanImageToVideoPipeline.encode_image`'s flat structure. - Inline `_denoise_rollout` into `AnyFlowPipeline.__call__`. For the FAR pipeline, inline both `_denoise_rollout` and `_inference` as a nested loop (outer over chunks, inner over denoising steps), mirroring `WanAnimatePipeline.__call__`. `encode_kv_cache` is intentionally kept as a method — it is one transformer call with a different `kv_cache_flag` mode (cache-write), and inlining it would interleave two distinct forward semantics in the same loop body and lose readability. - Rename `context_sequence` → `video` (pixel-space) + `video_latents` (pre-encoded), matching `WanVideoToVideoPipeline`. For the FAR pipeline, the old `{"raw"/"latent"}` dict form is replaced by the two kwargs. Mutually-exclusive validation raises `ValueError`. Addresses review comments at pipeline_anyflow.py:358, :372, :393, :473 and pipeline_anyflow_far.py:395, :489, :675. * [AnyFlow] scheduler: N-length timesteps + step defaults r_timestep Per @dg845's review on #13745: - `set_timesteps(N)` now produces `N` timesteps backed by an internal `sigmas[N+1]` linspace, matching `FlowMatchEulerDiscreteScheduler.set_ timesteps`. The final sigma (== 0) is the implicit r-endpoint of the last step; the pipeline rollouts iterate `for i, t in enumerate(timesteps)` without the old `[:-1]` slicing. - `step(r_timestep=None)` now defaults to the next timestep on the schedule (resolved via fp-tolerant `argmin` over `sigmas[:-1]`), instead of raising. Any-step sampling is preserved when `r_timestep` is explicit. The raise stays only for the case where the caller passes a `timestep` value that isn't on the schedule and provides no `r_timestep` — there's no sensible default in that case. - Build sigmas in float64 on CPU then move to the target device, with a float32 downcast for MPS / NPU (float64 isn't supported on those backends). Pipeline rollout loops updated to compute `r = sigmas[i + 1] * num_train_ timesteps` for the model's `r_timestep` input and pass `r_timestep=None` to `scheduler.step` (which resolves it from the schedule internally). Addresses review comments at scheduling_flow_map_euler_discrete.py:107 and :148. * [AnyFlow] tests: regenerate via generate_model_tests.py; split bidi/FAR files Per @dg845's review on #13745: replaced the hand-rolled transformer tests with the standard mixin-based suite produced by `utils/generate_model_tests .py`, and split the FAR causal model tests into their own file to mirror the transformer file split. - `tests/models/transformers/test_models_transformer_anyflow.py`: regenerated bidi suite. Pulls in `ModelTesterMixin`, `MemoryTesterMixin`, `TrainingTesterMixin`, `AttentionTesterMixin`, `TorchCompileTesterMixin` via `BaseModelTesterConfig`, with `get_init_dict()` / `get_dummy_inputs()` filled in for the small bidi config used in CI. - `tests/models/transformers/test_models_transformer_anyflow_far.py`: new. Same mixin set (TorchCompile is intentionally skipped — FAR's `_build_causal_mask` uses `flex_attention.create_block_mask(_compile=False)` which conflicts with the standard compile tester's assumptions; the bidi file covers compile, FAR is bit-exact-validated end-to-end on H200 via the pipeline replay). Also carries an `AnyFlowCausalAttnProcessor` smoke test that exercises the backend gate (non-flex backends must raise) and asserts the `AnyFlowFARTransformerOutput` dataclass exposes the expected fields. Addresses review comments at test_models_transformer_anyflow.py:71 and :128. * [AnyFlow] docs: update for video / video_latents kwarg rename Following the pipeline kwarg refactor in e9d50b2, sweep the user-facing docs to reflect the new API: - `docs/source/en/api/pipelines/anyflow.md`: T2V / I2V / V2V code examples now use `video=` instead of `context_sequence={"raw": ...}`. The "Generation with AnyFlow (FAR Causal)" intro describes the new mutually-exclusive `video` / `video_latents` selector. - `docs/source/en/using-diffusers/anyflow.md`: the scenario selector table, the "Image-to-video and video-to-video" walkthrough, and the closing note about pre-encoded latents are all updated. `vae_encode` references are replaced with `encode_video`. * [AnyFlow] tests: skip FAR training tests on CPU (flex backward); align scheduler tests with N-length timesteps - TestAnyFlowFARTransformer3DTraining: skip test_training / test_training_with_ema / test_gradient_checkpointing_equivalence on CPU. FAR causal self-attn uses torch.nn.attention.flex_attention whose backward kernel is GPU-only. - test_scheduler_flow_map_euler_discrete: assert timesteps is N-length (not N+1) and the sigma=0 r-endpoint lives in self.sigmas[-1]; test_step_one_shot_sampling now exercises r_timestep=None (resolved from sigmas) since N=1 has no timesteps[1]. * [AnyFlow] docs: complete forward() Args: sections for check_forward_call_docstrings main #13758 added utils/check_forward_call_docstrings.py which requires every signature arg to appear as its own `name (...):` entry under Args:. Expand the bidi and FAR transformer forward docstrings to list each parameter individually. * [AnyFlow] apply 5/21 review suggestions (A: 1-click) FAR transformer: - AnyFlowCausalAttnProcessor: default _attention_backend = 'flex' (was None); remove None from _SUPPORTED_BACKENDS. None previously fell through to SDPA which silently ignored the BlockMask; failing loudly is the right default. - dispatch_attention_fn call: read self._attention_backend instead of hardcoded 'flex', so '_native_flex' selection works. - _build_freqs / _forward_full_frame: add '# Copied from' to bidi RoPE. Pipelines: - bidi + FAR docstrings: video shape (B, C, T, H, W) -> (B, T, C, H, W) to match VideoProcessor.preprocess_video. - FAR EXAMPLE_DOC_STRING: single-frame I2V tensor wrap uses unsqueeze(1) for the T axis instead of unsqueeze(2). - FAR encode_video: drop duplicated @torch.no_grad() decorator. Tests: - test_anyflow / test_anyflow_far: lift the test_save_load_optional_components skip (the test actually passes). - FAR processor smoke test: assert default backend is 'flex' (was 'None'). * [AnyFlow] apply 5/21 review suggestions (B: refactors) Pipelines: - check_inputs accepts video / video_latents and raises early on: (a) mutual exclusion (was checked late in __call__); (b) FAR's (num_frames - 1) % 4 == 0 constraint. __call__ no longer carries duplicate validation. - FAR pipeline: drop the show_progress kwarg and replace the single tqdm with nested progress bars in the LLaDA-2 pattern — outer 'Chunks' (position=0) and per-chunk inner 'Inference Steps' (position=1, leave=False) — both picking up DiffusionPipeline._progress_bar_config (so set_progress_bar_config controls them, including disable=None). Scheduler: - step() resolves source and target sigmas by indexing self.sigmas via the new index_for_timestep(), instead of dividing the input timesteps by num_train_timesteps. This keeps the math correct for any future schedule whose timestep/sigma relationship is non-linear. For an off-schedule r_timestep the code falls back to r / num_train_timesteps, so explicit any-step sampling outside the schedule still works (and t off-schedule with r=None still raises a clear ValueError, as before). Numerical equivalence: for the shipped linspace+shift schedule the two formulations are bit-identical (verified: max abs diff = 0.0 over an N=8, shift=5 schedule). * [AnyFlow] apply Claude bot review (5/21): 8 findings beyond dg845's list Finding #1 — attention_kwargs plumbing: Both transformers now decorate forward() with @apply_lora_scale('attention_kwargs') (matches Wan); pipelines forward attention_kwargs to the transformer + encode_kv_cache, and the unused parameter is dropped from the inner _forward_train / _forward_cache / _forward_inference signatures. Pipeline docstrings updated to the standard wording. Finding #2 — naming: Rename far_cfg -> layout_cfg in the bidi transformer (the bidi path is not FAR; the FAR transformer keeps far_cfg, which is accurate there). Finding #3 — scheduler state machine: Add _step_index, _begin_index, step_index property, begin_index property, set_begin_index(), _init_step_index(). step() lazily initializes and advances the counter so downstream callbacks / composable schedulers can observe rollout progress. Sigma resolution remains a pure function of (timestep, r_timestep) — calling step() twice with identical args still returns identical prev_sample (idempotent). Finding #4 — redundant @torch.no_grad(): Drop the redundant decorators on bidi pipeline's encode_video and FAR pipeline's encode_kv_cache (callers are already in __call__'s no-grad scope). Finding #5 — dead code: Remove the unreachable temb.ndim == 2 else branch from the bidi transformer's output-norm path (condition_embedder.forward always returns a 3D temb). Finding #6 — private rename: forward_far_patchify[_inference] -> _forward_far_patchify[_inference] (only called internally by _forward_train / _forward_cache / _forward_inference). Finding #7 — pipeline comment numbering: Bidi + FAR pipelines renumber steps so the # 4. slot is no longer skipped. Finding #8 — mask-mod comment numbering: _build_causal_mask numbered comments now run 1) 2) 3) ... (was 1) 3) 4) ...). Tests: - New test_step_index_advances + test_set_begin_index_anchors_step_index in the scheduler test file exercise the new state machine. - All existing pipeline / transformer / scheduler tests still pass (85 passed, 85 skipped on CPU). Bit-exact: 8-step rollout vs the previous formulation, max abs diff = 0.0 (the new sigma-lookup is byte-identical to t/num_train_timesteps on this schedule). * [AnyFlow] scheduler: honour off-schedule any-step in _init_step_index; drop dead _resolve_next_timestep Audit caught two issues in the previous scheduler commit: 1. The new state machine raised in _init_step_index whenever the first timestep wasn't on the active schedule, contradicting the documented contract that step() falls back to t/num_train_timesteps for off-schedule any-step sampling. The fall-back numerics were intact but they were unreachable — the init check fired first. Fix: _init_step_index now initializes _step_index to 0 when the timestep is off-schedule (still a valid observable counter for callbacks). step()'s sigma resolution is untouched, so on-schedule rollouts stay bit-exact and off-schedule any-step sampling actually runs again. Regression test: test_step_off_schedule_anystep_supported. 2. _resolve_next_timestep had no remaining callers after the step() rewrite inlined the same lookup. Removed (private helper, no external API). * [AnyFlow] docs: align user guides with video shape + kwarg fixes - en api/pipelines/anyflow.md: video shape (B, C, T, H, W) -> (B, T, C, H, W); example tensor wrap uses unsqueeze(0).unsqueeze(1) and permute(0, 3, 1, 2) to match VideoProcessor.preprocess_video's 5D contract. - zh using-diffusers/anyflow.md: same shape fixes; also flip the I2V / V2V examples from the obsolete context_sequence={...} dict to the current video= / video_latents= kwargs; helper to_video_tensor returns (1, T, C, H, W); add a note about mutual exclusion. * [AnyFlow] tests: drop @slow integration test scaffolds for initial PR .ai/skills/model-integration/SKILL.md is explicit: 'No integration / slow tests in the initial PR — don't add anything gated on @slow / RUN_SLOW=1 yet.' Our two integration test classes were shape-only assertions with TODOs for a future numeric reference, so dropping them loses no actual coverage — the relevant rollouts are covered by H200 bit-exact replay outside the pytest suite. Can land a follow-up PR after merge with proper numeric reference slices once the maintainer is comfortable enabling slow tests. * Apply style fixes * [AnyFlow] apply 5/22 dg845 review: comment cleanups + custom sigmas/timesteps schedule dg845 third pass — 7 of 9 comments applied; the 8th (custom sigmas/timesteps support) matches FlowMatchEulerDiscreteScheduler conventions; the 9th (_build_causal_mask refactor) is explicitly marked non-blocking and deferred to a follow-up that also re-enables TorchCompileTesterMixin. Comment cleanups: - transformer_anyflow.py:704 temb output-norm comment: drop redundant 'no ndim==2 branch'. - pipeline_anyflow.py:550 denoise loop comment: '# 6. Denoising loop'. - pipeline_anyflow_far.py:684 denoise loop comment: '# 8. Denoising loop (outer over chunks, inner over timesteps).'. - pipeline_anyflow_far.py:702 drop trailing inline comment on `timesteps = scheduler.timesteps`. - scheduling_flow_map_euler_discrete.py: clearer wording on the off-schedule `r_timestep` error. Custom schedule support: - FlowMapEulerDiscreteScheduler.set_timesteps gains `sigmas` and `timesteps` kwargs mirroring FlowMatchEulerDiscreteScheduler. Default behaviour is unchanged (linspace + shift); the validation + length-N → length-N+1 terminal-0 append are shared with the default path so on-schedule rollouts stay bit-exact. - AnyFlowPipeline.__call__ and AnyFlowFARPipeline.__call__ accept `sigmas` and `timesteps` kwargs, override num_inference_steps from their length, and forward to set_timesteps (matches LTX2Pipeline pattern). - New scheduler tests: test_set_timesteps_custom_sigmas and test_set_timesteps_custom_timesteps cover both override paths. Dtype skip on save/load: - TestAnyFlowTransformer3D and TestAnyFlowFARTransformer3D now skip test_from_save_pretrained_dtype_inference (parametrized over fp16/bf16), mirroring WanTransformer3DModel's skip — the test's tolerance requirements are too high for meaningful signal under AnyFlow's flow-map mixed-precision sampling. * [AnyFlow] docs: apply hf-doc-builder line wrap (max_len 119) CI doc-builder style check flagged 3 files with docstring lines >119 chars. Ran 'doc-builder style src/diffusers docs/source --max_len 119' to autoformat; content unchanged, line wrapping only. * [AnyFlow] apply 5/22 follow-up review: new_zeros terminal sigma + cleanup dg845 blocking suggestion (r3287274209): - scheduling_flow_map_euler_discrete.py:185 — use `working_sigmas.new_zeros(1)` instead of `torch.zeros(1, dtype=...)` so the appended terminal sigma inherits both device and dtype from working_sigmas. The current working_sigmas always starts on CPU so the device mismatch is latent, but new_zeros is the correct defensive pattern and matches how the published FAR test fixtures run on CUDA. Claude bot final-review follow-ups: - transformer_anyflow_far.py: drop three stale `# step 3: generate attention mask` comments left over from the original numbered-step structure (bot #6). - pipeline_anyflow_far.py: annotate `encode_video` with `# Copied from diffusers.pipelines.anyflow.pipeline_anyflow.AnyFlowPipeline.encode_video` and align docstring + inline comment so `make fix-copies` keeps them in sync (bot #3). Skipped (not real / judgment-call): - bot #2 (private rename of `_forward_far_patchify*`) — already done in 84605d5; bot was looking at a stale snapshot. - bot #4 (check_inputs `# Copied from`) — FAR's check_inputs has an extra `(num_frames - 1) % 4 == 0` constraint that doesn't map onto the bidi version, so a clean `# Copied from` link would require restructuring. Bot called it a consistency nit; leaving as-is. - bot #5 (`encode_kv_cache` → `_encode_kv_cache`) — bot itself flagged this as judgment-call territory; the helper is a coherent operation that advanced inference callers may want to invoke directly. --------- Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…up (#13792) * [AnyFlow] FAR: standalone causal-mask builder + torch.compile follow-up Follow-up to #13745. Extracts FAR mask construction to a module-level helper and adds an `attention_mask` forward kwarg so AnyFlowFARTransformer3DModel can be wrapped in `torch.compile(fullgraph=True)`. The pipeline pre-builds the mask during KV-cache prefill so users get end-to-end fullgraph compile. * Public method `AnyFlowFARTransformer3DModel.build_attention_mask(...)` (modes: "train", "cache") plus private module-level helper `_build_anyflow_far_causal_block_mask(...)`. * `_build_freqs` cache lookup/write bypassed under `torch.compiler.is_compiling()` to avoid a Dynamo guard recompile on the second compiled call (applied in bidi source; synced to FAR via `# Copied from`). * `TestAnyFlowFARTransformer3DCompile(TorchCompileTesterMixin)` — recompilation_and_graph_break, repeated_blocks, and group_offloading pass on H200; AOT is `@pytest.mark.skip`'d (torch.export rejects BlockMask as a pytree input). * Base `get_dummy_inputs` omits `attention_mask` so every non-compile test class exercises the in-forward fallback; the compile class overrides to inject a pre-built mask. * Bit-exact: pre-built path vs internal-build fallback max|Δ|=0.0e+00. * [AnyFlow] docs: full author list, repo demo examples, slimmer pipeline page * Full author list and NVIDIA → NUS → MIT institution order; TL;DR + abstract + Available Models bullets. * Rewritten pipeline-selection tip describing both pipelines symmetrically. * T2V / I2V / V2V examples now use the canonical 81-frame setup and the demo prompts / conditioning assets shipped under `NVlabs/AnyFlow/assets/evaluation/` (linked via raw.githubusercontent.com). * Drop the inline "Optimizing Memory" and "torch.compile" sections — those notes will live in the NVlabs/AnyFlow repo's own performance guide rather than the diffusers pipeline reference. * Sync zh user guide and the two model-API stubs. * [AnyFlow] FAR: move chunk_partition default into transformer config - AnyFlowFARTransformer3DModel.__init__ now accepts chunk_partition via @register_to_config (default (1, 3, 3, 3, 3, 3, 3, 2) for the released 81-frame checkpoints, matching the field on Hub). - AnyFlowFARPipeline.__call__ no longer requires chunk_partition; defaults to self.transformer.config.chunk_partition. Per-call override still supported for V2V / non-default num_frames. - Drop the AnyFlowFARPipeline.default_chunk_partition class attribute. - Update docs (en pipelines/models, zh using-diffusers) and the conversion script to match. * [AnyFlow] FAR pipeline: fix `timesteps` shadowing across chunks Inside the per-chunk rollout loop, the local variable `timesteps` was reassigned to `self.scheduler.timesteps` after `set_timesteps()`. On the next chunk iteration the same name was passed back into `set_timesteps(timesteps=...)`, where a non-None value enters the *custom-schedule* branch — `apply_shift` re-runs on already-shifted values, double-shifting the schedule for every chunk after the first. Concretely, with `shift=5` and `num_inference_steps=4`: - chunk 0 timesteps: [1000, 937.5, 833.3, 625] (correct) - chunk 1+ timesteps: [1000, 986.8, 961.3, 892.9] (double-shifted) The later steps drift toward `t=1000` instead of toward `t=0`, the flow-map model is conditioned on the wrong source sigma, and the chunk KV cache accumulates errors that show up as artifacts in later video frames. Fix: rebind the cached schedule to a fresh local name (`scheduler_timesteps`) so the outer-scope `timesteps` kwarg (the user-provided custom schedule, when any) stays untouched across chunks. Layer-by-layer verification against the NVlabs reference implementation on H200 (elephant prompt, seed 0, 4 NFE, 81 frames): - chunk 0 inference: bit-exact (0.0 mean diff) - chunk 1 step 0: 0.194 → 0.014 (-93%) - chunk 7 last step: 0.564 → 0.274 (-51%) * [AnyFlow] FAR: doc-builder line wrap for chunk_partition docstrings Pure rewrap to satisfy `doc-builder style --max_len 119`. Two docstrings introduced in 96077b2 (the `chunk_partition` config arg on the FAR transformer + the matching pipeline kwarg) wrapped a few characters short of the line budget. No semantic change. * [AnyFlow] docs: drop author names from docstrings, link FAR via HF papers, say chunk-wise - Remove author-name attributions from the transformer / pipeline class docstrings and file-header comments; the paper-citation header on the doc page keeps the full author list, the in-code references just point at the [AnyFlow] / [FAR] papers. - Link FAR via its Hugging Face papers page (https://huggingface.co/papers/2503.19325) instead of a raw arxiv.org URL, matching the AnyFlow reference style and the rest of the diffusers docs. - Describe AnyFlow FAR generation as "chunk-wise autoregressive": the pipeline autoregresses over chunks (`chunk_partition`), not single frames. * [AnyFlow] FAR: address review nits - pipeline: reuse the standard `timesteps` variable name for the per-chunk scheduler timesteps; freeze the caller-provided custom schedule in `custom_timesteps`/`custom_sigmas` before the loop so it isn't re-fed into `set_timesteps` and double-shifted on later chunks. - transformer: clarify the no-mask fallback comment to spell out the `torch.compile(fullgraph=True)` graph-break behavior and the `build_attention_mask` workaround. --------- Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
What does this PR do?
This PR adds pipelines for AnyFlow (paper, project page, official code, model weights), an any-step video diffusion framework built on flow maps. A single distilled checkpoint can be evaluated at 1, 2, 4, 8, 16, 32 NFE without retraining, and quality scales monotonically with steps — unlike consistency-based distillation, which often degrades as NFE grows.
Two new pipelines are added, both on top of a new
FlowMapEulerDiscreteSchedulerand reusingWanLoraLoaderMixin:AnyFlowPipeline→AnyFlowTransformer3DModel: bidirectional text-to-video built on the Wan2.1 backbone with anAnyFlowDualTimestepTextImageEmbeddingconditioning on the source/target timestep pair(t, r).AnyFlowFARPipeline→AnyFlowFARTransformer3DModel: frame-level autoregressive variant (block-sparse causalflex_attention+ KV cache + compressed-frame patch embedding) jointly handling T2V / I2V / V2V through onecontext_sequenceargument.Four checkpoints are released under the
nvidia/anyflowcollection (Wan2.1-T2V-{1.3B,14B}bidi +FAR-Wan2.1-{1.3B,14B}causal). All four have been validated bit-exact against the officialNVlabs/AnyFlowreference on H200: forward L2 =0.00e+00for scheduler / transformer / bidi pipeline / FAR pipeline; backward grad delta is4.88e-04, attributable to bf16 kernel non-determinism only (PR-vs-PR = PR-vs-reference, ratio1.000); inference latency matches the reference at ±0.0% on both pipelines.T2V inference example:
I2V inference example with the FAR pipeline (single conditioning frame → autoregressive rollout):
Documentation: EN tutorial at
docs/source/en/using-diffusers/anyflow.md, ZH tutorial atdocs/source/zh/using-diffusers/anyflow.md, and three API pages (pipelines + two transformer model pages). Tests: 22 fast tests (transformer + scheduler, CPU) plus four pipeline test files, with slow integration tests gated onRUN_SLOW=1 @require_torch_acceleratorfor the released checkpoints.anyflow-pr-presentation.mp4
Before submitting
Who can review?
@yiyixuxu @asomoza